383 research outputs found
Alfred: A System for Prompted Weak Supervision
Alfred is the first system for programmatic weak supervision (PWS) that
creates training data for machine learning by prompting. In contrast to typical
PWS systems where weak supervision sources are programs coded by experts,
Alfred enables users to encode their subject matter expertise via natural
language prompts for language and vision-language models. Alfred provides a
simple Python interface for the key steps of this emerging paradigm, with a
high-throughput backend for large-scale data labeling. Users can quickly
create, evaluate, and refine their prompt-based weak supervision sources; map
the results to weak labels; and resolve their disagreements with a label model.
Alfred enables a seamless local development experience backed by models served
from self-managed computing clusters. It automatically optimizes the execution
of prompts with optimized batching mechanisms. We find that this optimization
improves query throughput by 2.9x versus a naive approach. We present two
example use cases demonstrating Alfred on YouTube comment spam detection and
pet breeds classification. Alfred is open source, available at
https://github.com/BatsResearch/alfred.Comment: ACL 2023 System Demonstration Trac
Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification
A promising approach for improving the performance of vision-language models
like CLIP for image classification is to extend the class descriptions (i.e.,
prompts) with related attributes, e.g., using brown sparrow instead of sparrow.
However, current zero-shot methods select a subset of attributes regardless of
commonalities between the target classes, potentially providing no useful
information that would have helped to distinguish between them. For instance,
they may use color instead of bill shape to distinguish between sparrows and
wrens, which are both brown. We propose Follow-up Differential Descriptions
(FuDD), a zero-shot approach that tailors the class descriptions to each
dataset and leads to additional attributes that better differentiate the target
classes. FuDD first identifies the ambiguous classes for each image, and then
uses a Large Language Model (LLM) to generate new class descriptions that
differentiate between them. The new class descriptions resolve the initial
ambiguity and help predict the correct label. In our experiments, FuDD
consistently outperforms generic description ensembles and naive LLM-generated
descriptions on 12 datasets. We show that differential descriptions are an
effective tool to resolve class ambiguities, which otherwise significantly
degrade the performance. We also show that high quality natural language class
descriptions produced by FuDD result in comparable performance to few-shot
adaptation methods.Comment: Code: https://github.com/BatsResearch/fud
Zero-Shot Learning with Common Sense Knowledge Graphs
Zero-shot learning relies on semantic class representations such as
hand-engineered attributes or learned embeddings to predict classes without any
labeled examples. We propose to learn class representations from common sense
knowledge graphs. Common sense knowledge graphs are an untapped source of
explicit high-level knowledge that requires little human effort to apply to a
range of tasks. To capture the knowledge in the graph, we introduce ZSL-KG, a
general-purpose framework with a novel transformer graph convolutional network
(TrGCN) for generating class representations. Our proposed TrGCN architecture
computes non-linear combinations of the node neighbourhood and shows
improvements on zero-shot learning tasks in language and vision. Our results
show ZSL-KG outperforms the best performing graph-based zero-shot learning
framework by an average of 2.1 accuracy points with improvements as high as 3.4
accuracy points. Our ablation study on ZSL-KG with alternate graph neural
networks shows that our TrGCN adds up to 1.2 accuracy points improvement on
these tasks
Tight Lower Bounds on Worst-Case Guarantees for Zero-Shot Learning with Attributes
We develop a rigorous mathematical analysis of zero-shot learning with
attributes. In this setting, the goal is to label novel classes with no
training data, only detectors for attributes and a description of how those
attributes are correlated with the target classes, called the class-attribute
matrix. We develop the first non-trivial lower bound on the worst-case error of
the best map from attributes to classes for this setting, even with perfect
attribute detectors. The lower bound characterizes the theoretical intrinsic
difficulty of the zero-shot problem based on the available information -- the
class-attribute matrix -- and the bound is practically computable from it. Our
lower bound is tight, as we show that we can always find a randomized map from
attributes to classes whose expected error is upper bounded by the value of the
lower bound. We show that our analysis can be predictive of how standard
zero-shot methods behave in practice, including which classes will likely be
confused with others
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
Large-scale neural network models combining text and images have made
incredible progress in recent years. However, it remains an open question to
what extent such models encode compositional representations of the concepts
over which they operate, such as correctly identifying ''red cube'' by
reasoning over the constituents ''red'' and ''cube''. In this work, we focus on
the ability of a large pretrained vision and language model (CLIP) to encode
compositional concepts and to bind variables in a structure-sensitive way
(e.g., differentiating ''cube behind sphere'' from ''sphere behind cube''). In
order to inspect the performance of CLIP, we compare several architectures from
research on compositional distributional semantics models (CDSMs), a line of
research that attempts to implement traditional compositional linguistic
structures within embedding spaces. We find that CLIP can compose concepts in a
single-object setting, but in situations where concept binding is needed,
performance drops dramatically. At the same time, CDSMs also perform poorly,
with best performance at chance level
- …